Online speech synthesis using a chronically implanted brain–computer interface in an individual with ALS

Brain–computer interfaces (BCIs) that reconstruct and synthesize speech using brain activity recorded with intracranial electrodes may pave the way toward novel communication interfaces for people who have lost their ability to speak, or who are at high risk of losing this ability, due to neurological disorders. Here, we report online synthesis of intelligible words using a chronically implanted brain-computer interface (BCI) in a man with impaired articulation due to ALS, participating in a clinical trial (ClinicalTrials.gov, NCT03567213) exploring different strategies for BCI communication. The 3-stage approach reported here relies on recurrent neural networks to identify, decode and synthesize speech from electrocorticographic (ECoG) signals acquired across motor, premotor and somatosensory cortices. We demonstrate a reliable BCI that synthesizes commands freely chosen and spoken by the participant from a vocabulary of 6 keywords previously used for decoding commands to control a communication board. Evaluation of the intelligibility of the synthesized speech indicates that 80% of the words can be correctly recognized by human listeners. Our results show that a speech-impaired individual with ALS can use a chronically implanted BCI to reliably produce synthesized words while preserving the participant’s voice profile, and provide further evidence for the stability of ECoG for speech-based BCIs.

speech production (see Supplementary Fig. 2).From the raw ECoG signals, our closed-loop speech synthesizer extracted broadband high-gamma power features (70-170 Hz) that had previously been demonstrated to encode speech-related information useful for decoding speech (Fig. 1B) 10,14 .
We used a unidirectional RNN to identify and buffer sequences of high-gamma activity frames and extract speech segments (Fig. 1C,D).This neural voice activity detection (nVAD) model internally employed a strategy to correct misclassified frames based on each frame's temporal context, and additionally included a context window of 0.5 s to allow for smoother transitions between speech and non-speech frames.Each buffered sequence was forwarded to a bidirectional decoding model that mapped high-gamma features onto 18 Bark-scale cepstral coefficients 34 and 2 pitch parameters, henceforth referred to as LPC coefficients 35,36 (Fig. 1E,F).We used a bidirectional architecture to include past and future information while making frame-wise predictions.Estimated LPC coefficients were transformed into an acoustic speech signal using the LPCNet vocoder 36 and played back as delayed auditory feedback (Fig. 1G).

Synthesis performance
When deployed in sessions with the participant for online decoding, our speech-synthesis BCI was reliably capable of producing acoustic speech that captured many details and characteristics of the voice and pacing of the participant's natural speech, often yielding a close resemblance to the words spoken in isolation from the participant.Figure 2A provides examples of original and synthesized waveforms for a representative selection of words time-aligned by subtracting the duration of the extracted speech segment from the nVAD.Onset timings from the reconstructed waveforms indicate that the decoding model captured the flow of the spoken word while also synthesizing silence around utterances for smoother transitions.A comparison between voice activity for spoken and synthesized speech revealed a median Levenstein distance of 235 ms, hinting that the synthesis approach was capable of generating speech that adequately matched the timing of the spoken counterpart.Figure 2B shows the corresponding acoustic spectrograms for the spoken and synthesized words, respectively.The spectral structures of the original and synthesized speech shared many common characteristics and achieved average correlation scores of 0.67 (± 0.18 standard deviation) suggesting that phoneme and formant-specific information were preserved.
We conducted 3 sessions across 3 different days (approximately 5 and a half months after the training data was acquired, each session lasted 6 min) to repeat the experiment with acoustic feedback from the BCI to the  To assess the intelligibility of the synthesized words, we conducted listening tests in which human listeners played back individual samples of the synthesized words and selected the word that most closely resembled each sample.Additionally, we mixed in samples that contained the originally spoken words.This allowed us to assess the quality of the participant's natural speech.We recruited a cohort of 21 native English speakers to listen to all samples that were produced during our 3 closed-loop sessions.Out of 180 samples, we excluded 2 words because the nVAD model did not detect speech activity and therefore no speech output was produced by the decoding model.We also excluded a few cases where speech activity was falsely detected by the nVAD model, which resulted in synthesized silence and remained unnoticed to the participant.
Overall, human listeners achieved an accuracy score of 80%, indicating that the majority of synthesized words could be correctly and reliably recognized.Figure 2C presents the confusion matrix regarding only the synthesized samples where the ground truth labels and human listener choices are displayed on the X-and Y-axes respectively.The confusion matrix shows that human listeners were able to recognize all but one word at very high rates."Back" was recognized at low rates, albeit still above chance, and was most often mistaken for "Left".This could have been due in part to the close proximity of the vowel formant frequencies for these two words.The participant's weak tongue movements may have deemphasized the acoustic discriminability of these words, in turn resulting in the vocoder synthesizing a version of "back" that was often indistinct from "left".In contrast, the confusion matrix also shows that human listeners were confident in distinguishing the words "Up" and "Left".The decoder synthesized an intelligible but incorrect word in only 4% of the cases, and all listeners accurately recognized the incorrect word.Note that all keywords in the vocabulary were chosen for intuitive command and control of a computer interface, for example a communication board, and were not designed to be easily discriminable for BCI applications.
Figure 2D summarizes individual accuracy scores from all human listeners from the listening test in a histogram.All listeners recognized between 75 and 84% of the synthesized words.All human listeners achieved accuracy scores above chance (16.7%).In contrast, when tested on the participant's natural speech, our human listeners correctly recognized almost all samples of the 6 keywords (99.8%).

Anatomical and temporal contributions
In order to understand which cortical areas contributed to identification of speech segments, we conducted a saliency analysis 37 to reveal the underlying dynamics in high-gamma activity changes that explain the binary decisions made by our nVAD model.We utilized a method from the image processing domain 38 that queries spatial information indicating which pixels have contributed to a classification task.In our case, this method ranked individual high-gamma features over time by their influence on the predicted speech onsets (PSO).We defined the PSO as the first occurrence when the nVAD model identified spoken speech and neural data started to get buffered before being forwarded to the decoding model.The absolute values of their gradients allowed interpretations of which contributions had the highest or lowest impact on the class scores from anatomical and temporal perspectives.
The general idea is illustrated in Fig. 3B.In a forward pass, we first estimated for each trial the PSO by propagating through each time step until the nVAD model made a positive prediction.From here, we then applied backpropagation through time to compute all gradients with respect to the model's input high-gamma features.Relevance scores |R| were computed by taking the absolute value of each partial derivative and the maximum value across time was used as the final score for each electrode 38 .Note that we only performed backpropagation through time for each PSO, and not for whole speech segments.
Results from the saliency analysis are shown in Fig. 3A.For each channel, we display the PSO-specific relevance scores by encoding the maximum magnitude of the influence in the size of the circles (bigger circles mean stronger influence on the predictions), and the temporal occurrence of that maximum in the respective color coding (lighter electrodes have their maximal influence on the PSO earlier).The color bar at the bottom limits the temporal influence to − 400 ms prior to PSO, consistent with previous reports about speech planning 39 and articulatory representations 19 .The saliency analysis showed that the nVAD model relied on a broad network of electrodes covering motor, premotor and somatosensory cortices whose collective changes in the high-gamma activity were relevant for identifying speech.Meanwhile, voice activity information encoded in the dorsal laryngeal area (highlighted electrodes in the upper grid in Fig. 3A) 19 only mildly contributed to the PSO.
Figure 3C shows relevance scores over a time period of 1 s prior to PSO for 3 selected electrodes that strongly contributed to predicting speech onsets.In conjunction with the color coding from Fig. 3A, the temporal associations were consistent with previous studies that examined phoneme decoding over fixed window sizes of 400 ms 18 and 500 ms 40,41 around speech onset times, suggesting that the nVAD model benefited from neural activity during speech planning and phonological processing 39 when identifying speech onset.We hypothesize that the decline in the relevance scores after − 200 ms can be explained by the fact that voice activity information might have already been stored in the long short-term memory of the nVAD model and thus changes in neural activity beyond this time had less influence on the prediction.

Discussion
Here we demonstrate the feasibility of a closed-loop BCI that is capable of online synthesis of intelligible words using intracranial recordings from the speech cortex of an ALS clinical trial participant.Recent studies 10,11,13,27 suggest that deep learning techniques are a viable tool to reconstruct acoustic speech from ECoG signals.We found an approach consisting of three consecutive RNN architectures that identify and transform neural speech correlates into an acoustic waveform that can be streamed over the loudspeaker as neurofeedback, resulting in an 80% intelligibility score on a closed-vocabulary, keyword reading task.
The majority of human listeners were able to correctly recognize most synthesized words.All words from the closed vocabulary were chosen for a prior study 28 that explored speech decoding for intuitive control of a communication board rather than being constructed to elicit discriminable neural activity that benefits decoder performance.The listening tests suggest that the words "Left" and "Back" were responsible for the majority of misclassified words.These words share very similar articulatory features, and our participant's speech impairments likely made these words less discriminable in the synthesis process.
Saliency analysis showed that our nVAD approach used information encoded in the high-gamma band across predominantly motor, premotor and somatosensory cortices, while electrodes covering the dorsal laryngeal area only marginally contributed to the identification of speech onsets.In particular, neural changes previously reported to be important for speech planning and phonological processing 19,39 appeared to have a profound impact.Here, the analysis indicates that our nVAD model learned a proper representation of spoken speech processes, providing a connection between neural patterns learned by the model and the spatio-temporal dynamics of speech production.
Our participant was implanted with 128 subdural ECoG electrodes, roughly half of which covered cortical areas where similar high-gamma responses have been reliably elicited during overt speech 18,19,40,42 and have been used for offline decoding and reconstruction of speech 10,11 .This study and others like it 24,27,43,44 explored the potential of ECoG-based BCIs to augment communication for individuals with motor speech impairments due to a variety of neurological disorders, including ALS and brainstem stroke.A potential advantage of ECoG for BCI is the stability of signal quality over long periods of time 45 .In a previous study of an individual with locked-in syndrome due to ALS, a fully implantable ECoG BCI with fewer electrodes provided a stable switch for a spelling application over a period of more than 3 years 46 .Similarly, Rao et al. reported robust responses for ECoG recordings over the speech-auditory cortex for two drug-resistant epilepsy patients over a period of 1.5 years 47 .More recently, we showed that the same clinical trial participant could control a communication board with ECoG decoding of self-paced speech commands over a period of 3 months without retraining or recalibration 28 .The speech synthesis approach we demonstrated here used training data from five and a half months prior to testing and produced similar results over 3 separate days of testing, with recalibration but no retraining in each session.These findings suggest that the correspondence between neural activity in ventral sensorimotor cortex and speech acoustics were not significantly changed over this time period.Although longitudinal testing over www.nature.com/scientificreports/longer time periods will be needed to explicitly test this, our findings provide additional support for the stability of ECoG as a BCI signal source for speech synthesis.
Our approach used a speech synthesis model trained on neural data acquired during overt speech production.This constrains our current approach to patients with speech motor impairments in which vocalization is still possible and in which speech may still be intelligible.Given the increasing use of voice banking among people living with ALS, it may also be possible to the intelligibility of synthetic speech using an approach similar to ours, even in participants with unintelligible or absent speech.This speech could be utilized as a surrogate but would require careful alignment to speech attempts.Likewise, the same approach could be used with a generic voice, though this would not preserve the individual's speech characteristics.Here our results were achieved without the added challenge of absent ground truth, but they serve as an important demonstration that if adequate alignment is achieved, direct synthesis of acoustic speech from ECoG is feasible, accurate, and stable, even in a person with dysarthria due to ALS.Nevertheless, it remains to be seen how long our approach will continue to produce intelligible speech as our patient's neural responses and articulatory impairments change over time due to ALS.Previous studies of long-term ECoG signal stability and BCI performance in patients with more severe motor impairments suggest that this may be possible 3,48 .
Although our approach allowed for online, closed-loop production of synthetic speech that preserved our participant's individual voice characteristics, the bidirectional LSTM imposed a delay in the audible feedback until after the patient spoke each word.We considered this delay to be not only acceptable, but potentially desirable, given our patient's speech impairments and the likelihood of these impairments worsening in the future due to ALS.Although normal speakers use immediate acoustic feedback to tune their speech motor output 49 , individuals with progressive motor speech impairments are likely to reach a point at which there is a significant, and distracting, mismatch between the subject's speech and the synthetic speech produced by the BCI.In contrast, providing acoustic feedback immediately after each utterance gives the user clear and uninterrupted output that they can use to improve subsequent speech attempts, if necessary.
While our results are promising, the approach used here did not allow for synthesis of unseen words.The bidirectional architecture of the decoding model learned variations of the neural dynamics of each word and was capable of recovering their acoustic representations from corresponding sequences of high-gamma frames.This approach did not capture more fine-grained and isolated part-of-speech units, such as syllables or phonemes.However, previous research 11,27 has shown that speech synthesis approaches based on bidirectional architectures can generalize to unseen elements that were not part of the training set.Future research will be needed to expand the limited vocabulary used here, and to explore to what extent similar or different approaches are able to extrapolate to words that are not in the vocabulary of the training set.
Our demonstration here builds on previous seminal studies of the cortical representations for articulation and phonation 19,32,40 in epilepsy patients implanted with similar subdural ECoG arrays for less than 30 days.These studies and others using intraoperative recordings have also supported the feasibility of producing synthetic speech from ECoG high-gamma responses 10,11,33 , but these demonstrations were based on offline analysis of ECoG signals that were previously recorded in subjects with normal speech, with the exception of the work by Metzger et al. 27 Here, a participant with impaired articulation and phonation was able to use a chronically implanted investigational device to produce acoustic speech that retained his unique voice characteristics.This was made possible through online decoding of ECoG high-gamma responses, using an algorithm trained on data collected months before.Notwithstanding the current limitations of our approach, our findings here provide a promising proof-of-concept that ECoG BCIs utilizing online speech synthesis can serve as alternative and augmentative communication devices for people living with ALS.Moreover, our findings should motivate continued research on the feasibility of using BCIs to preserve or restore vocal communication in clinical populations where this is needed.

Participant
Our participant was a male native English speaker in his 60s with ALS who was enrolled in a clinical trial (NCT03567213), approved by the Johns Hopkins University Institutional Review Board (IRB) and by the FDA (under an investigational device exemption) to test the safety and preliminary efficacy of a brain-computer interface composed of subdural electrodes and a percutaneous connection to external EEG amplifiers and computers.All experiments conducted in this study complied with all relevant guidelines and regulations, and were performed according to a clinical trial protocol approved by the Johns Hopkins IRB.Diagnosed with ALS 8 years prior to implantation, our participant's motor impairments had chiefly affected bulbar and upper extremity muscles and had resulted in motor impairments sufficient to render continuous speech mostly unintelligible (though individual words were intelligible), and to require assistance with most activities of daily living.Our participant's ability to carry out activities of daily living were assessed using the ALSFRS-R measure 50 , resulting in a score of 26 out of 48 possible points (speech was rated at 1 point, see Supplementary Data S5).Furthermore, speech intelligibility and speaking rate were evaluated by a certified speech-language pathologist, whose detailed assessment may be found in the Supplementary Note.The participant gave informed consent after being counseled about the nature of the research and implant-related risks and was implanted with the study device in July 2022.Additionally, the participant gave informed consent for use of his audio and video recordings in publications of the study results.

Study device and implantation
The study device was composed of two 8 × 8 subdural electrode grids (PMT Corporation, Chanhassen, MN) connected to a percutaneous 128-channel Neuroport pedestal (Blackrock Neurotech, Salt Lake City, UT).Both Vol:.( 1234567890 The study device was surgically implanted during a standard awake craniotomy with a combination of local anesthesia and light sedation, without neuromuscular blockade.The device's ECoG grids were placed on the pial surface of sensorimotor representations for speech and upper extremity movements in the left hemisphere.Careful attention was made to assure that the scalp flap was well away from the external pedestal.Cortical representations were targeted using anatomical landmarks from pre-operative structural (MRI) and functional imaging (fMRI), in addition to somatosensory evoked potentials measured intraoperatively.Two reference wires attached to the Neuroport pedestal were implanted in the subdural space on the outward facing surface of the subdural grids.The participant was awoken during the craniotomy to confirm proper functioning of the study device and final placement of the two subdural grids.For this purpose, the participant was asked to repeatedly speak a single word as event-related ECoG spectral responses were noted to verify optimal placement for the implanted electrodes.On the same day, the participant had a post-operative CT which was then co-registered to a pre-operative MRI to verify the anatomical locations of the two grids.

Data recording
During all training and testing sessions, the Neuroport pedestal was connected to a 128-channel NeuroPlex-E headstage that was in turn connected by a mini-HDMI cable to a NeuroPort Biopotential Signal Processor (Blackrock Neurotech, Salt Lake City, UT, USA) and external computers.We acquired neural signals at a sampling rate of 1000 Hz.
Acoustic speech was recorded through an external microphone (BETA® 58A, SHURE, Niles, IL) in a room isolated from external acoustic and electronic noise, then amplified and digitized by an external audio interface (H6-audio-recorder, Zoom Corporation, Tokyo, Japan).The acoustic speech signal was split and forwarded to: (1) an analog input of the NeuroPort Biopotential Signal Processor (NSP) to be recorded at the same frequency and in synchrony with the neural signals, and (2) the testing computer to capture high-quality (48 kHz) recordings.We applied cross-correlation to align the high-quality recordings with the synchronized audio signal from the NSP.

Experiment recordings and task design
Each recording day began with a syllable repetition task to acquire cortical activity to be used for baseline normalization.Each syllable was audibly presented through a loudspeaker, and the participant was instructed to recite the heard stimulus by repeating it aloud.Stimulus presentation lasted for 1 s, and trial duration was set randomly in the range of 2.5 s and 3.5 s with a step size of 80 ms.In the syllable repetition task, the participant was instructed to repeat 12 consonant-vowel syllables (Supplementary Table S4), in which each syllable was repeated 5 times.We extracted high-gamma frames from all trials to compute for each day the mean and standard deviation statistics for channel-specific normalization.
To collect data for training our nVAD and speech decoding model, we recorded ECoG during multiple blocks of a speech production task over a period of 6 weeks.During the task, the participant read aloud single words that were prompted on a computer screen, interrupted occasionally by a silence trial in which the participant was instructed to say nothing.The words came from a closed vocabulary of 6 words ("Left", "Right", "Up", "Down", "Enter", "Back", and "…" for silence) that were chosen for a separate study in which these spoken words were decoded from ECoG to control a communication board 28 .In each block, there were ten repetitions of each word (60 words in total) that appeared in a pseudo-randomized order by having a fixed set of seeds to control randomization orders.Each word was shown for 2 s per trial with an intertrial interval of 3 s.The participant was instructed to read the prompted word aloud as soon as it appeared.Because his speech was slow, effortful, and dysarthric, the participant may have sometimes used some of the intertrial interval to complete word production.However, offline analysis verified at least 1 s between the end of each spoken word and the beginning of the next trial, assuring that enough time had passed to avoid ECoG high-gamma responses leaking into subsequent trials.In each block, neural signals and audibly vocalized speech were acquired in parallel and stored to disc using BCI2000 51 .
We recorded training, validation, and test data for 10 days, and deployed our approach for synthesizing speech online five and a half months later.During the online task, the synthesized output was played to the participant while he performed the same keyword reading task as in the training sessions.The feedback from each synthesized word began after he spoke the same word, avoiding any interference with production from the acoustic feedback.The validation dataset was used for finding appropriate hyperparameters to train both nVAD and the decoding model.The test set was used to validate final model generalizability before online sessions.We also used the test set for the saliency analysis.In total, the training set was comprised of 1570 trials that aggregated to approximately 80 min of data (21.8 min are pure speech), while the validation and test set contained 70 trials each with around 3 min of data (0.9 min pure speech).The data in each of these datasets were collected on different days, so that no baseline or other statistics in the training set leaked into the validation or test set.

Signal processing and feature extraction
Neural signals were transformed into broadband high-gamma power features that have been previously reported to closely track the timing and location of cortical activation during speech and language processes 42,52 .In this feature extraction process, we first re-referenced all channels within each 64-contact grid to a common-average reference (CAR filtering), excluding channels with poor signal quality in any training session.Next, we selected all channels that had previously shown significant high-gamma responses during the syllable repetition task described above.This included 64 channels (Supplementary Fig. S2, channels with blue outlines) across motor, www.nature.com/scientificreports/premotor and somatosensory cortices, including the dorsal laryngeal area.From here, we applied two IIR Butterworth filters (both with filter order 8) to extract the high-gamma band in the range of 70 to 170 Hz while subsequently attenuating the first harmonic (118-122 Hz) of the line noise.For each channel, we computed logarithmic power features based on windows with a fixed length of 50 ms and a frameshift of 10 ms.To estimate speech-related increases in broadband high-gamma power, we normalized each feature by the day-specific statistics the high-gamma power features accumulated from the syllable repetition task.
For the acoustic recordings of the participant's speech, we downsampled the time-aligned high-quality microphone recordings from 48 to 16 kHz.From here, we padded the acoustic data by 16 ms to account for the shift introduced by the two filters on the neural data and estimated the boundaries of speech segments using an energy-based voice activity detection algorithm 53 .Likewise, we computed acoustic features in the LPC coefficient space through the encoding functionality of the LPCNet vocoder.Both voice activity detection and LPC feature encoding were configured to operate on 10 ms frameshifts to match the number of samples from the broadband high-gamma feature extraction pipeline.

Network architectures
Our proposed approach relied on three recurrent neural network architectures: (1) a unidirectional model that identified speech segments from the neural data, (2) a bidirectional model that translated sequences of speechrelated high-gamma activity into corresponding sequences of LPC coefficients representing acoustic information, and (3) LPCNet 36 , which converted those LPC coefficients into an acoustic speech signal.
The network architecture of the unidirectional nVAD model was inspired by Zen et al. 54 in using a stack of two LSTM layers with 150 units each, followed by a linear fully connected output layer with two units representing speech or non-speech class target logits (Fig. 4).We trained the unidirectional nVAD model using truncated backpropagation through time (BPTT) 55 to keep the costs of single parameter updates manageable.We initialized this algorithm's hyperparameters k 1 and k 2 to 50 and 100 frames of high-gamma activity, respectively, such that the unfolding procedure of the backpropagation step was limited to 100 frames (1 s) and repeated every 50 frames (500 ms).Dropout was used as a regularization method with a probability of 50% to counter overfitting effects 56 .Comparison between predicted and target labels was determined by the cross-entropy loss.We limited the network training using an early stopping mechanism that evaluated after each epoch the network performance on a held-out validation set and kept track of the best model weights by storing the model weights only when the frame-wise accuracy score was bigger than before.The learning rate of the stochastic gradient descent optimizer was dynamically adjusted in accordance with the RMSprop formula 57 with an initial learning rate of 0.001.Using this procedure, the unidirectional nVAD model was trained for 27,975 update steps, achieving a frame-wise accuracy of 93.4% on held-out validation data.The architecture of the nVAD model had 311,102 trainable weights.
The network architecture of the bidirectional decoding model had a very similar configuration to the unidirectional nVAD but employed a stack of bidirectional LSTM layers for sequence modelling 11 to include past and future contexts.Since the acoustic space of the LPC components was continuous, we used a linear fully connected output layer for this regression task.Figure 4 contains an illustration of the network architecture of the decoding model.In contrast to the unidirectional nVAD model, we used standard BPTT to account for both past and future contexts within each extracted segment identified as spoken speech.The architecture of the decoding model had 378,420 trainable weights and was trained for 14,130 update steps using a stochastic gradient descent optimizer.The initial learning rate was set to 0.001 and dynamically updated in accordance with the RMSProp www.nature.com/scientificreports/formula.Again, we used dropout with a 50% probability and employed an early stopping mechanism that only updated model weights when the loss on the held-out validation set was lower than before.Both the unidirectional nVAD and the bidirectional decoding model were implemented within the PyTorch framework.For LPCNet, we used the C-implementation and pretrained model weights by the original authors and communicated with the library via wrapper functions through the Cython programming language.

Closed-loop architecture
Our closed-loop architecture was built upon ezmsg, a general-purpose framework which enables the implementation of streaming systems in the form a directed acyclic network of connected units, which communicate with each other through a publish/subscribe software engineering pattern using asynchronous coroutines.Here, each unit represents a self-contained operation which receives many inputs, and optionally propagates its output to all its subscribers.A unit consists of a settings and state class for enabling initial and updatable configurations and has multiple input and output connection streams to communicate with other nodes in the network.Figure 4 shows a schematic overview of the closed-loop architecture.ECoG signals were received by connecting to BCI2000 via a custom ZeroMQ (ZMQ) networking interface that sent packages of 40 ms over the TCP/IP protocol.From here, each unit interacted with other units through an asynchronous message system that was implemented on top of a shared-memory publish-subscribe multi-processing pattern.Figure 4 shows that the closed-loop architecture was comprised of 5 units for the synthesis pipeline, while employing several additional units that acted as loggers and wrote intermediate data to disc.
In order to play back the synthesized speech during closed-loop sessions, we wrote the bytes of the raw PCM waveform to standard output (stdout) and reinterpreted them by piping them into SoX.We implemented our closed-loop architecture in Python 3.10.To keep the computational complexity manageable for this streamlined application, we implemented several functionalities, such as ringbuffers or specific calculations in the highgamma feature extraction, in Cython.

Contamination analysis
Overt speech production can cause acoustic artifacts in electrophysiological recordings, allowing learning machines such as neural networks to rely on information that is likely to fail once deployed-a phenomenon widely known as Clever Hans 58 .We used the method proposed by Roussel et al. 59 to assess the risk that our ECoG recordings had been contaminated.This method compares correlations between neural and acoustic spectrograms to determine a contamination index which describes the average correlation of matching frequencies.This contamination index is compared to the distribution of contamination indices resulting from randomly permuting the rows and columns of the contamination matrix-allowing statistical analysis of the risk when assuming that no acoustic contamination is present.
For each recording day among the train, test and validation set, we analyzed acoustic contamination in the high-gamma frequency range.We identified 1 channel (Channel 46) in our recordings that was likely contaminated during 3 recording days (D 5 , D 6 , and D 7 ), and we corrected this channel by taking the average of highgamma power features from neighboring channels (8-neighbour configuration, excluding the bad channel 38).A detailed report can be found in Supplementary Fig. S1, where each histogram corresponds to the distribution of permuted contamination matrices, and colored vertical bars indicate the actual contamination index, where green and red indicate the statistical criterion threshold (green: p > 0.05, red: p ≤ 0.05).After excluding the neural data from channel 46, Roussel's method suggested that the null hypothesis could be rejected, and thus we concluded that no acoustic speech has interfered with neural recording.

Listening test
We conducted a forced-choice listening test similar to Herff et al. 14 in which 21 native English speakers evaluated the intelligibility of the synthesized output and the originally spoken words.Listeners were asked to listen to one word at a time and select which word out of the six options most closely resembled it.Here, the listeners had the opportunity to listen to each sample many times before submitting a choice.We implemented the listening test on top of the BeaqleJS framework 60 .All words that were either spoken or synthesized during the 3 closed-loop sessions were included in the listening test, but were randomly sampled from a uniform distribution for unique randomized sequences across listeners.Supplementary Fig. S3 provides a screenshot of the interface with which the listeners were working.
All human listeners were only recruited through indirect means such as IRB-approved flyers placed on campus sites and had no direct connection to the PI.Anonymous demographic data was collected at the end of the listening test asking for age and preferred gender.Overall, recruited participants were 23.8% male and 61.9% female (14% other or preferred not to answer) ranging between 18 to 30 years old.

Statistical analysis
Original and reconstructed speech spectrograms were compared using Pearson's correlation coefficients for 80 mel-scaled spectral bins.For this, we transformed original and reconstructed waveforms into the spectral domain using the short-time Fourier transform (window size: 50 ms, frameshift: 10 ms, window function: Hanning), applied 80 triangular filters to focus only on perceptual differences for human listeners 61 , and Gaussianized the distribution of the acoustic space using the natural logarithm.Pearson correlation scores were calculated for each sample by averaging the correlation coefficients across frequency bins.The 95% confidence interval (twosided) was used in the feature selection procedure while the z-criterion was Bonferroni corrected across time points.Lower and upper bounds for all channels and time points can be found in the supplementary data.Contamination analysis is based on permutation tests that use t-tests as their statistical criterion with a Bonferroni

Figure 1 .
Figure 1.Overview of the closed-loop speech synthesizer.(A) Neural activity is acquired from a subset of 64 electrodes (highlighted in orange) from two 8 × 8 ECoG electrode arrays covering sensorimotor areas for face and tongue, and for upper limb regions.(B) The closed-loop speech synthesizer extracts high-gamma features to reveal speech-related neural correlates of attempted speech production and propagates each frame to a neural voice activity detection (nVAD) model (C) that identifies and extracts speech segments (D).When the participant finishes speaking a word, the nVAD model forwards the high-gamma activity of the whole extracted sequence to a bidirectional decoding model (E) which estimates acoustic features (F) that can be transformed into an acoustic speech signal.(G) The synthesized speech is played back as acoustic feedback.

Figure 2 .
Figure 2. Evaluation of the synthesized words.(A) Visual example of time-aligned original and reconstructed acoustic speech waveforms and their spectral representations (B) for 6 words that were recorded during one of the closed-loop sessions.Speech spectrograms are shown between 100 and 8000 Hz with a logarithmic frequency range to emphasize formant frequencies.(C) The confusion matrix between human listeners and ground truth.(D) Distribution of accuracy scores from all who performed the listening test for the synthesized speech samples.Dashed line shows chance performance (16.7%).

Figure 3 .
Figure 3.Changes in high-gamma activity across motor, premotor and somatosensory cortices trigger detection of speech output.(A) Saliency analysis shows that changes in high-gamma activity predominantly from 300 to 100 ms prior to predicted speech onset (PSO) strongly influenced the nVAD model's decision.Electrodes covering motor, premotor and somatosensory cortices show the impact of model decisions, while electrodes covering the dorsal laryngeal area only modestly added information to the prediction.Grey electrodes were either not used, bad channels or had no notable contributions.(B) Illustration of the general procedure on how relevance scores were computed.For each time step t, relevance scores were computed by backpropagation through time across all previous high-gamma frames X t .Predictions of 0 correspond to no-speech, while 1 represents speech frames.(C) Temporal progression of mean magnitudes of the absolute relevance score in 3 selected channels that strongly contributed to PSOs.Shaded areas reflect the standard error of the mean (N = 60).Units of the relevance scores are in 10 -3 .

Figure 4 .
Figure 4. System overview of the closed-loop architecture.The computational graph is designed as a directed acyclic network.Solid shapes represent ezmsg units, dotted ones represent initialization parameters.Each unit is responsible for a self-contained task and distributes their output to all its subscribers.Logger units run in separate processes to not interrupt the main processing chain for synthesizing speech.
subdural grids contained platinum-iridium disc electrodes (0.76 mm thickness, 2-mm diameter exposed surface) with 4 mm center-to-center spacing and a total surface area of 12.11 cm 2 (36.6 mm × 33.1 mm).